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ABSTRACT 


Building and especially improving a classification kernel rep- 
resents a challenging task. The works presented in this paper 
continue an already developed semi-supervised classification 
approach that aimed at labelling transcripts from educa- 
tional videos. We questioned whether the size of the ground- 
truth data-set (Wikipedia articles) or the quality of the key- 
words used in the semi-supervised labelling have a significant 
impact on the accuracy metrics of the final obtained data 
model. Experimental results took into consideration three 
Wikipedia data-sets of Small, Medium and Large sizes. For 
each data-set there were used three sets of keywords: offered 
by video authors, determined by rake-nitk on available tran- 
scripts and determined by rake-nltk on Wikipedia articles 
that serve as training and testing data for the LDA model 
that determine keywords on the transcripts. Experiments 
show that the size of the data-set has little importance, while 
the quality of the keywords has a more significant impact. 
Therefore, an improved version of the previously developed 
classifier has been obtained by improving the quality of the 
keywords involved in semi-supervised training. This result 
paves the way towards further improvements that may fi- 
nally be deployed as within a recommender system of edu- 
cational videos at the Universitat Politécnica de Valencia. 


Keywords 


classification, educational transcripts, keywords, data-set size 


1. INTRODUCTION 

Over the last few years, the quantity of online learning ob- 

jects (LO) [6] and Massive Online Learning Courses (MOOCs) 
have increased dramatically representing a real boom in on- 

line learning. This boom of online learning resources has 

caused a problem for students, as they have hundreds of 

thousands of online documentation. At the same time, dif- 

ferent approaches to discover topics and hidden semantic 

structures in text have been proposed with the goal of go 


Theodora Danciulescu, Stella Heras, Javier Palanca, Vicente 
Julian and Cristian Mihaescu "More Data and Better Keywords 
Imply Better Educational Transcript Classification?" In: 
Proceedings of The 13th International Conference on 
Educational Data Mining (EDM 2020), Anna N. Rafferty, Jacob 
Whitehill, Violetta Cavalli-Sforza, and Cristobal Romero (eds.) 
2020, pp. 381 - 387 


Stella Heras, Javier Palanca, 
Vicente Julian 
Universitat Politecnica de Valencia 
Sistemas Informaticos y Computacion 
stehebar@upv.es, jpalanca@dsic.upv.es 
vjulian@upv.es 


forward on topic modelling which has been a challenging 
and critical issue for information retrieval. Therefore, tak- 
ing into account all of this, topic modelling has become in a 
trending topic for the e-learning research community. Fol- 
lowing that trend, the Universitat Politécnica de Valéncia 
(UPV) in Spain launched a video lectures sharing website, 
called Polimedia’, and a MOOC platform, called UPV/X]’, 
which is powered by the edX MOOC platform?. 


Both proposals have a basic search engine allowing students 
to search for videos (learning objects) by simply using a set 
of keywords. Current solutions compare these keywords with 
some typical metadata of the videos (title, authors, ...) and 
returns the set of videos that match with this data. Obvi- 
ously, this basic retrieval solution overlooks any semantics, 
which produces incomplete results that do not take into ac- 
count some videos that are relevant for the student but that 
do not include any of the keywords in their titles. 


The MOOCs we are using in this work consists of a set of 
educational videos that have an automatic transcription of 
the lectures that is going to be used as part of the input 
data for this proposal. The motivation of this work is to use 
this information to help students to find more suited learn- 
ing objects, personalized to their interests, in these massive 
online platforms where the number of learning objects grows 
quickly and they usually are not tagged correctly. 


According to this, this paper focuses on the improvement 
of this search engine proposing a new retrieval method that 
uses a dataset extracted from Wikipedia articles and that 
is trained to classify keywords based on the topic of the 
available educational videos. This proposed model is an 
improvement of a previous work presented in [14], where 
pre-tagged wikipedia articles were used as ground-truth. In 
this work we improve this semi-supervised method by: 1) 
automatically tagging Wikipedia articles and using them to 
create an extended dataset for training the semi-supervised 
method, and 2) proposing an improved pipeline for cleaning 
the data, extract keywords and obtain a better classification 
model that improves the precision of the student’s searches. 


The rest of the paper is structured as follows: Section 2 
presents some works related to the topic of this paper; Sec- 
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tion 3 details the approach proposed by the authors; Section 
4 presents some experimental results; and finally, Section 5 
shows the conclusions of this work. 


2. RELATED WORK 


The problem of the correct keyword extraction is a recur- 
rent problem over the last few years. Different works have 
appeared trying to solve this problem using different ap- 
proaches. At the end, the idea is to have a solid set of words 
that concisely represent the content of a text (in this case 
the content of a learning object). 


Most of the last approaches on document-oriented meth- 
ods of keyword extraction use natural language processing 
(NLP) techniques mainly based on machine learning algo- 
rithms and statistical methods. One of the most well-known 
approaches is the work presented in [17] where authors pro- 
pose the use of Support Vector Machines as a way to extract 
the most important keywords. 


On the other hand, the work in [9] presents a solution based 
on the graph-based syntactic representation of text and web 
documents that combines supervised and unsupervised learn- 
ing. In a similar way, the work presented in [7] proposes an 
unsupervised keyword extraction technique including sev- 
eral different ways of the conventional TF-IDF model with 
reasonable heuristics. Other approaches, like the work pre- 
sented in [12] called Rapid Automatic Keyword Extraction 
(RAKE), employ unsupervised methods for extracting key- 
words which are domain-independent, and also, language- 
independent. 


The latent Dirichlet allocation (LDA) model is one of the 
most used techniques to classify documents according to a 
set of topics. One example is the work presented in [1] that 
automatically captures the thematic patterns and identi- 
fies emerging topics using a non-Markov on-line LDA Gibbs 
sampler topic model. In the online educational field, the 
LDA model has been used in works such us the presented 
in [16] where the authors use topic detection for the analy- 
sis of the feedback submitted by students in online courses. 
The work in [10] tries to solve the problem of topic detection 
by identifying words that appear with high frequency in the 
topic and low frequency in other topics. 


Some works face the keyword extraction problem in learning 
objects through the use of other approaches such as ontolo- 
gies like the work presented in [8] that aims to improve the 
effectiveness of retrieval and accessibility of learning objects 
integrating semantic knowledge through domain-specific on- 
tologies. In [4] authors use Wikipedia to associate learning 
objects to Wikipedia pages, specifically with the topics of 
those pages, trying to find relationships among learning ob- 
jects. 


Finally, recent work also uses intelligent algorithms and method 


to face other challenges of efficient videolecures manage- 
ment, such as video shots skimming [15] and supervised 
multi-class classification [5]. 


Opposite to most related works, our method is fully semi- 
supervised, with no need for a previously tagged database 
nor an ontology, that can act as ground truth to train the 


models. Also, to the best of our knowledge, there are no 
other intelligent systems trained to automatically classify a 
Spanish database of educational videos. 


3. PROPOSED APPROACH 


From a classification perspective, the first issue is to clearly 
state the actual number of topics (i.e., labels) that exist in 
available transcripts. Since all transcripts come from educa- 
tional videos from UPV, it certainly means that the number 
of topics is represented by the domains from which videos 
come from, that is biology & sciences (BS), engineering(E) 
and humanities & arts (HA). BS topic considers aspects 
of bacteria, diseases, bio-engineering, bio-medicine, E topic 
considers aspects of computers, electrical, architecture, civil, 
aerospace. In contrast, HA considers aspects of laws, arts, 
social and economic. 


The proposed approach extends the semi-supervised method 
described in a previous paper-work[14]. It improves the data 
analysis pipeline in terms of accuracy of classification on the 
videos currently available in the database. As in the ini- 
tial approach, the training on Wikipedia articles uses the 
SVM[3] classification algorithm, which used a Radial Basis 
Function (RBF) kernel from the sklearn library[11]. The val- 
idation approach uses the same two steps: 1) train on 70% of 
Wikipedia articles and cross-validate with 15%, 2) train on 
labelled transcripts and validate on remaining unseen 15% 
of Wikipedia articles. 


Internally, the semi-supervised training has been performed 
on a set of labelled Wikipedia articles by building a data 
model that has been used for classifying educational tran- 
scripts and their associated keywords. The transcripts which 
had the same label as the keywords were considered cor- 
rectly labelled and therefore were added to the initial train- 
ing dataset. The newly obtained dataset is used in an itera- 
tive semi-supervised set up for training in an attempt to tag 
as many educational transcripts as possible. 


One limitation of previous works is that HA items were 
mislabeled as E. This flaw may be caused by the fact that 
videos about HA reach more various subjects, that are not 
so domain-specific. Mathematics videos with proofs demon- 
stration and analysis are also not correctly labelled as there 
is a large number of words that are not mathematics domain- 
specific. Many videos about the economy and economic envi- 
ronments tend to be categorised as E, as many explanations 
heavily use mathematics and calculus. A positive aspect is 
that the classification for BS items is acquiring excellent re- 
sults, there are no confusions made for this domain. This be- 
haviour is expected as this domain has many specific terms 
and principles, so videos from this area are easily classifiable 
and do not create confusions. 


As a first step to improve the previous work[14] was to ex- 
tend the Wikipedia articles data-set for training the semi- 
supervised method. This was done progressively, as we com- 
pared results with the previous ones and checked manually 
if the videos that were badly classified have been classi- 
fied correctly. The decision about the amount and about 
which Wikipedia articles categories should be downloaded 
was made by manually analysing the clustering results from 
previous work. By doing so, we obtained best results with 
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three versions of datasets: a Small data-set (3747 Wikipedia 
articles), a Medium dataset(6373 Wikipedia articles) and a 
Large dataset (18527 Wikipedia articles). 


Secondly, we focused on the importance of relevant key- 
words to obtain a good classification result. There were 
provided three sets of keywords supplied by three different 
methods. The first set was obtained using the same process 
from the previous paper[14] by using the keywords provided 
by the videos’ authors. However, we observed inconsisten- 
cies as some videos do not offer keywords in their metadata. 
The second set of keywords was obtained by using rake- 
nltk[13] tool for extracting the keywords directly from the 
transcripts’ text. Finally, the third set was obtained by us- 
ing rake-nltk tool for getting keywords from the Large data- 
set of Wikipedia articles to use them as training and testing 
data for an LDA (Latent Dirichlet Allocation)[2] model that 
will extract domain-specific tags from the transcripts. 


3.1. Training on more Wikipedia articles 
Intuitively, more data should help to improve the accuracy, 
but in practical situation this may not happen. An issue that 
currently occurs in machine learning systems is whether or 
not the size of the data-set is too small for the classification 
problem. Proper debugging of the data analysis pipeline 
should clearly point out if current accuracy results may be 
improved by using a larger data-set or other leverages should 
be taken into consideration. 


As a first approach, we tried to detect a pattern in the clas- 
sification errors and download the appropriate Wikipedia 
articles to cover the subjects in the videos that were mistak- 
enly classified. Consequently, when choosing the Wikipedia 
articles, not only the covering of the topics was taken into 
consideration but also the quantity of the articles about that 
subject was an important factor. 


In response to this, additionally to the initial Small dataset 
used in the previous work[14] we obtained two new datasets: 
Medium dataset with a total of 6373 articles (i.e., 1219 BS 
articles, 2737 HA articles, and 1626 E articles), and a Large 
dataset with 18526 Wikipedia articles (including 5830 BS 
articles, 5882 E articles and 6814 HA articles). 


3.2 Determining better keywords 

The transcripts’ keywords represent a key-point for the clas- 
sification algorithm, as the quality of the classification may 
be directly influenced by the relevance and quality of the 
keywords. 


A second solution was represented by the rake-nltk tool, as 
it supports the Spanish language and it provides good re- 
sults for this language, too. Rake-nitk tool is a domain- 
independent keyword extraction algorithm which tries to 
determine key phrases in a body of text by analyzing the 
frequency of word appearance and its co-occurrence with 
other words in the text. 


After trying to classify the videos in 3 clusters (BS, E and 
HA) using three different sized data-sets (i.e., Small, Medium 
and Large) for training and two different methods for as- 
signing keywords to each transcript (the manually provided 
keywords by authors and the keywords extracted with rake- 


nltk), we finally use the third method of providing more 
domain-specific keywords for every transcript: we used LDA 
as business logic for the implementation of transcript key- 
words recommendation system and used rake-nltk for pro- 
viding keywords for Wikipedia articles to obtain training 
and testing data. 


As the transcripts and the keywords from the metadata (i.e. 
authors’ keywords) do not represent a valid data-set (the 
words used as keywords are either ambiguous, either too 
name specific and they often induce classification errors). 


The limitation of the second method consists from the fact 
that the keywords provided by rake-nitk from transcripts 
were large and with numerous phrases without a focus on 
the essential subject of the video, also causing classification 
errors in some cases. So, a third solution was needed: there 
were used Wikipedia articles and keywords extracted with 
rake-nitk as training and testing data set for the LDA model 
to extract domain-specific keywords from the transcripts. 
The third solution is combining the rake-nitk tool with the 
LDA model. Rake-nitk will be used to extract keywords 
from the Wikipedia articles resulting in a labelled dataset 
that will serve later as training and testing dataset for the 
LDA model to extract domain-specific keywords from the 
transcripts. 


The second approach provides new keywords for every tran- 
script by using rake-nitk. The keywords extracted with this 
tool were also pre-processed by eliminating stop words and 
lowering all the letters. However, there still is one disad- 
vantage for this method: the keywords extracted are large 
phrases that are not necessarily very domain-specific. More- 
over, the extracted sentences are ambiguous in some cases, 
lacking the essential subject of the transcript. This error is 
most likely to be caused by the fact that the transcripts are 
not always subject-focused, they usually have an introduc- 
tion about the teacher, the subject in general, many exam- 
ples are provided. Hence, there is a broad set of words that 
may induce errors. 


The third approach used rake-nltk tool, not for extracting 
keywords directly for our transcripts, but for extracting key- 
words for each article from the Wikipedia articles Large 
data-set (18526 articles). The tagged Wikipedia articles 
using rake-nltk will be used as training data for assigning 
keywords to the video transcripts employing LDA. 


Figure 1 presents in detail the data analysis pipeline for the 
third method of providing keywords. This method is being 
described in this section in particular. 


The following steps were followed for obtaining the domain- 
specific transcript tags recommendation algorithm utilizing 
LDA: 

Create a balanced and large data-set of Wikipedia 
articles in Spanish. By saying to have a balanced data- 
set, there are supposed to be enough BS articles to obtain 
a set of keywords for BS, enough E articles to get a set of 
keywords for this domain, and most important enough HA 
articles to form a set of tags for this domain, too. The diffi- 
cult part was to get a good set of keywords for HA domain, as 
this cluster covers a wide range of fields like Economy, Law, 
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Figure 1: Pipeline for extracting keywords from Wikipedia articles 


18546 Wikipedia 
spanish articles+ 


their 18546 titles 


eliminate stopwords, 
punctuation, keep the 
singular form of the 


Keep the first 
5000 most 
occured tags 


Keep for every 
article only the tags 
that are in top 5000 
most occured tags 


Use rake-nltk for Spanish 
to append tags to each 
item from data-set 


Create set with unique 
appearances of all tags 


Recommending 


count how many 
items got this tag 
\. appended 


Create the TF-IDF 
matrices, 80/20 
split 


Find the best Ida 
model (the number of 
topics that provides 


algorithm by 
score and 


threshold 


the lowest perplexity 


score) 


Arts, Architecture, Language learning, Politics, Social Sci- 
ences, Philosophy, Psychology and basically anything that 
does not fit in the other two clusters. 

Clean the text from the downloaded Wikipedia arti- 
cles by lowering text, removing undesirable marks and stop 
words, using the singular form of the word. Append each 
Wikipedia article tags using rake-nitk tool and also clean 
(lower text, remove undesirable marks, remove stop words, 
use the singular form of the word) these tags. For better 
results, there are also tags extracted from the titles of the 
Wikipedia articles. That means that we pull tags for 18526 
x 2 items. 

Add all these tags in a set to have only unique appear- 
ances of the extracted tags. 

Count how many Wikipedia articles were assigned to 
each tag from the set. 

Get top 5000 most occurred tags (having less tags, it 
means that only the most occurred tags from each domain 
will be kept, and in this way, a classification with the semi- 
supervised method will be simpler to perform with a smaller 
training data-set) 


Keep only the top 5000 occurring tags for each Wikipedia 


article. 

Keep only the articles that are still labelled. After 
these operations, we end up with 21743 labelled items out 
of 37092 items. 

Create the TF-IDF matrices by splitting our obtained 
data set in 80%/20%. 

We try to train various LDA models using sklearn* im- 
plementation [18], by assigning each of them a different topic 
number, then the different models are evaluated on the test 
set using the metric perplexity. By definition, the lower the 
perplexity, the better the model. 

Showing the perplexity score for several LDA models 
with different values for n_components parameter, and print- 
ing the top words for the best LDA model (the one with the 
lowest perplexity). 

Now that we have designed the workflow, we focus on the 


keywords recommendation algorithm for the transcripts, which 


is based on two main aspects: 


“https: / /scikit-learn.org/stable/modules/generated / 
sklearn.decomposition.Latent Dirichlet Allocation. html 


e Score = probability that document is assigned to a 
specific topic, represents the topic’s probability of gen- 
erating the word. 


e A word is considered as a relevant tag, when its score 
is superior to a defined threshold. After testing dif- 
ferent values for the threshold, we decided to choose 
the threshold to 0.008, that is because, for this value, 
because with a threshold equals to 0.008 more than 95 
percents of the transcripts have recommended tags. 


Also, an advantage for obtaining keywords for every tran- 
script employing rake-nltk combined with LDA would be 
that all the videos will be classified. In the original method, 
only the videos that were provided keywords by authors 
could have been taken into consideration. Now, as we offer 
keywords to every transcript, all the videos with an avail- 
able transcript may be taken into consideration. An even 
bigger advantage is the fact that the training set contains 
articles about well-defined domains, their subject is focused 
on a small range of ideas, so the set of most frequently used 
tags will be very domain-specific, a fact that will be helpful 
for the classification algorithm. 


4. EXPERIMENTAL RESULTS 


After running the semi-supervised learning method for the 
Small, Medium and Large data-sets, and also with the three 
sets of keywords, the best results were obtained by train- 
ing the semi-supervised method with the Small data-set 
of Wikipedia articles and the keywords provided employ- 
ing rake-nitk for obtaining training and testing data and 
LDA to obtain the proper transcript’s tags. The results are 
presented in Table 1. This table also provides a detailed 
insight of the semi-supervised training process results along 
with the number of transcripts added to the model in every 
iteration and with the classification accuracy obtained for 
each label. The computation of the classification accuracy 
metrics is done on the validation data-set, which contains 
only unseen data in the training step. 


Analysis of the iterative semi-supervised training process in 
all nine scenarios (i.e., for three data-set sizes and for three 
methods of obtaining the keywords) revealed several pat- 
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Iteration (valid /available) Accuracy Class Precision | Recall | F1-score 

Biology&Sciences 0.95 0.93 0.94 

#1 (8487 / 14395) 0.92 (+/- 0.01) Engineering 0.88 0.92 0.90 
Humanities&Arts 0.95 0.93 0.94 

Biology&Sciences 0.92 0.94 0.93 

#2 (2375 / 5908) 0.96 (+/- 0.02) Engineering 0.87 0.88 0.87 
Humanities&Arts 0.95 0.93 0.94 

Biology&Sciences 0.90 0.94 0.92 

#8 (9 / 1940) 0.94 (+/- 0.05) Engineering 0.85 0.86 0.85 
Humanities&Arts 0.93 0.90 0.92 


Table 1: Validation scores for each iteration in the pipeline with the Small Wikipedia articles data-set and 
the keywords provided by means of rake-nltk for obtaining training and testing data and LDA to obtain the 


proper transcript’s tags. 


terns. The first observation regards the fact that the num- 
ber of iterations has low variance. So, irrespective of the 
size of Wikipedia data-set and the method for obtaining the 
keywords the number of iterations is in the range from six 
to twelve. This observation represents a clear indication 
that the size of the training data-set of Wikipedia articles 
does not highly influence the semi-supervised learning. An- 
other observation is that each step in the semi-supervised 
training keeps unchanged or slightly decreases the F1 score, 
while slightly increasing the accuracy of the 10-fold cross- 
validation on the Wikipedia test data-set. This observation 
shows that all experiments are consistent and produce simi- 
lar behavioural patterns in terms of accuracy, precision, re- 
call and F1-score measures evolution in terms of evolution 
during semi-supervised training. 


Table 2 presents the validation scores for all the three data- 
sets (ie., Small, Medium and Large) and all three keywords 
data-sets. 


The first observation regarding the validation results from 
table 2 regards the fact that there are no big differences in 
terms of overall accuracy and F1-scores for the three data- 
sets of keywords and for each training data-set. Still, the 
method with rake-nltk for Wikipedia articles keywords and 
LDA for obtaining transcript keywords generally has better 
scores than the other two methods for cluster 2. Still, it 
has usually lower scores for cluster 1. This pattern shows an 
indication that improvements in classification metrics should 
focus on classes where poorly results occur. 


We further observe that scores tend to slightly decrease as 
the data-set is getting larger. Therefore for the Medium 
data-set, only the method with rake-nitk for extracting tran- 
script keywords provides better results than it does with the 
Small data-set. A particular result consists in major score 
decreases for cluster 1 for the Medium data-set. This is 
mainly due to the unbalance of this data-set regarding the 
items from labelled in class 1. The imbalance of class 1 
is also signalled by the excellent results for classes 0 and 
1 in the experiment with Large data-set and the method 
with rake-nltk for Wikipedia articles keywords and LDA for 
transcript keywords. 


Despite the Large data-set used for training the model, com- 
paring the time required to train the model with the Small 


data-set and the time necessary to train the model with 
the Large data-set with all three sets of keywords, we have 
noticed that the time has doubled in the worst case, even 
though the data-set used is 6 times larger than the initial 
one. 


Besides, the method to obtain keywords employing rake-nltk 
and LDA transcript keywords provide a better running-time 
execution for the Small and Large data-sets than the original 
keywords set as the number of iterations is also smaller. 


The method with rake-nltk and LDA transcript keywords 
provides best result for the Small data-set, though the rake- 
nltk transcript keywords methods has the best results for 
the Medium and Large data-sets. For the method to obtain 
domain-specific keywords for transcripts employing rake-nltk 
to extract Wikipedia articles keywords and LDA to extract 
the proper keywords for transcripts, the tags distribution 
per the 10 topics of the model is presented in Table 3. We 
also notice that the 10 topics do not mix the three domains 
that we are interested about: E tags are found only in topics 
that do not contain tags from the other two domains, and 
the same for BS tags and HA tags. There can be easily 
noticed the domain that each topic covers: the topics with 
indexes 1, 2, 5, 9 and 10 are focused on HA domain, the 
topics with indexes 4, 6 and 8 are focused on E domain, and 
finally, the topics 3 and 7 are focused on BS domain. 


Furthermore, the topic order shows that the first three most 
important topics are 4, 3 and 9, where 4 is focused on the 
E domain, 3 is concentrated in BS tags, and 9 is focused on 
HA tags. Considering that the first three most important 
topics contain one topic for each of the three domains that 
we are interested in, ultimately confirms that the model is 
suitable for our purpose. In addition, the following 3 topics 
in the topic order are also distributed equally across the 
three domains. 


We can notice that the original keywords provided by au- 
thors are provided in different styles: some of them are too 
specific(tool names that are not so common), some of them 
too ambiguous to be categorised to a domain, and some of 
them provide domain-specific terms, but those terms may 
not be so standard in that domain in such a way to be cor- 
rectly categorised by put semi-supervised method that is not 
trained on a massive data-set. 
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Table 2: Validation scores for all data-sets and keywords sets 


Data-set Keywords Accuracy/Avg F1 | Class | Precision | Recall | F1-score 
0 0.96 0.86 0.91 
Original keywords 0.94/0.88 1 0.84 0.86 0.85 
2 0.86 0.90 0.88 
0 0.96 0.86 0.91 
Small rake-nitk transcript keywords 0.94/0.88 1 0.84 0.87 0.86 
2 0.86 0.90 0.88 
0 0.91 0.92 0.91 
rake-nltk and LDA transcript keywords 0.95/0.89 1 0.81 0.91 0.86 
2 0.93 0.84 0.88 
0 0.93 0.83 0.88 
Original keywords 0.94/0.86 1 0.75 0.92 0.83 
2 0.93 0.82 0.87 
0 0.93 0.85 0.89 
Medium rake-nitk transcript keywords 0.96/0.88 1 0.80 0.90 0.85 
2 0.93 0.87 0.90 
0 0.93 0.84 0.88 
rake-nltk and LDA transcript keywords 0.94/0.85 1 0.72 0.91 0.80 
2 0.93 0.79 0.85 
0 0.95 0.82 0.88 
Original keywords 0.95/0.86 1 0.80 0.90 0.85 
2 0.86 0.85 0.85 
0 0.95 0.81 0.88 
Large rake-nltk transcript keywords 0.96/0.86 1 0.82 0.88 0.85 
2 0.83 0.87 0.85 
0 0.93 0.82 0.87 
rake-nltk and LDA transcript keywords 0.95/0.85 1 0.77 0.90 0.83 
2 0.86 0.82 0.84 


Table 3: Highest score tags per topics in the LDA 
model 


Tl derecho / social / sociedad / politica / cultura 
T2 dato / software / aplicacién / versién / cédigo 
T3 célula / proteina / agua / animal / forma / celular 
T4 algoritmo / error / programa / memoria / ejecucién 
T5 mercado / precio / economia / financiero / empresa 
T6 displaystyle / teorfa / légica / matematica 

T 7 | tratamiento / cirugia / médico / paciente / sindrome 
T8 ecuacion / ingenieria / inteligencia / artificial 
T9 politica / andlisi / marketing / rama / arteria 

T 10 industrial / industria / plano / internacional 
Order 4, 3, 9, 8, 7, 5, 10, 6,2, I] 


The third method, the one that uses rake-nltk for providing 
keywords to the Wikipedia articles used for training and 
LDA for extracting transcript tags, provides a few labels, 
but they are very domain-specific. The tags that can be 
resulted from this method come from a relatively small set 
of possible tags (this set is formed by the most commonly 
used terms in the 3 domains of our clusters), so the most 
relevant tags from this set will be chosen. 


This is an advantage for our semi-supervised method as we 
can provide good results with a relatively small data-set for 
training. The words used for tags by this method are very 
likely to be well categorised by the semi-supervised method 
as they are very common only in the are of one of the three 
domains. 


5. CONCLUSIONS 


This paper has presented a method which combines the ex- 
traction of keywords from a Wikipedia data-set with the 
automatic classification of learning objects using LDA to ob- 
tain better keywords for searching educational videos. This 
will allow students to find more accurate resources for videos 
that have not been appropriately tagged by authors. 


Using Wikipedia for creating a labelled data-set has allowed 
us to build a balanced set of articles that have been used 
to train a model for extracting keywords from educational 
video transcripts. However, in future works, it would be 
interesting to provide an automatic mechanism for building 
balanced training data-sets. 


The proposed has been tested using a real environment, con- 
cretely the video lectures sharing website of the Universitat 
Politécnica de Valéncia, which has more than 55.000 short 
videos mainly in Spanish. Results have shown the benefits 
of this proposal for classifying learning objects into cate- 
gories (specifically Biology&Sciences, Engineering and Hu- 
manities& Arts), which will help students in their search of 
appropriated learning resources. 


Future works should focus on improving accuracy of the clas- 
sification especially for the classes with poorer results, that 
is Engineering and Humanities & arts as Biology transcripts 
are correctly classified. The obtained classifier may be fur- 
ther used for labeling new videos that may be added into 
UPV Media site. 
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